This tutorial includes three sections. First, it shows how to generate EXON, EPT, EAT and gene count matrix from a BAM file. Then, we run the current-state Seurat pipeline based on the gene expression matrix. This step will tell you how many cells and cell groups we can detect. In the last step, we use Yano to detect the alternative expressed EXONs, EPTs and EATs, and you will also learn how to interpret alternative expressed EPTs with alignment tracks. Before you get started, please ensure you have adequately complied with the following software or package in a standard R and Linux work environment.

1. Preprocess BAMs

We use a BAM file (~14G) of human glioblastoma multiforme from 10X Genomics’ resource page, and a GTF file from Ensembl. The GTF is used to annotate reads and EPTs in this demo. So if you use your own data in the test, please make sure to use the same GTF in whole practice.

Step 1, download the test data.

The following codes run in Terminal.

wget -c https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_SC3v3_Human_Glioblastoma/Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
wget -c https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_SC3v3_Human_Glioblastoma/Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam.bai
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz

Step 2, EPT calling and feature annotation.

The following codes run in Terminal.

These commands tell you how to generate the count files. But you donot need to run these steps now, the processed files already put in the working directory.

# Call expressed peak tags (EPTs) 
PISA callept -tag CB -umi UB -o epts.bed Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
# Annotate EPTs with gene annoation
PISA annobed -gtf gencode.v44.annotation.gtf.gz -o epts_anno.bed -skip-chrs epts.bed

# Annotate reads with EPTs, exon and gene annotation
PISA anno -gtf gencode.v44.annotation.gtf.gz -exon -bed epts.bed -tag EP -o anno.bam Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
# Create EPT and gene expression output directory
mkdir ept
mkdir exp
mkdir exon

# Generate gene X cell counts
PISA count -tag CB -umi UB -outdir exp -anno-tag GN anno.bam

# Generate exon X cell counts
PISA count -tag CB -umi UB -outdir exon -anno-tag EX anno.bam

# Generate EPT X cell counts
PISA count -tag CB -umi UB -outdir ept -anno-tag EP anno.bam

Now you can check if the files generate properly.

head epts_anno.bed
## chr1 14665   14787   .   .   -   1   WASH7P  intron
## chr1 134935  135281  .   .   -   1   ENSG00000268903 exonintron
## chr1 187036  187159  .   .   -   1   WASH9P  exonintron
## chr1 629103  629323  .   .   +   1   MTND1P23    exon
## chr1 629906  630021  .   .   -   1   ENSG00000230021 intron
## chr1 629906  630025  .   .   +   1   MTND2P28    exon
## chr1 631822  631943  .   .   +   1   MTCO1P12    exon
## chr1 632266  632703  .   .   +   1   MTCO1P12    exonintron
## chr1 633103  633310  .   .   +   1   MTCO2P12    exon
## chr1 633365  633463  .   .   +   1   MTCO2P12    exonintron
ls exp/
## barcodes.tsv.gz
## features.tsv.gz
## matrix.mtx.gz
ls exon/
## barcodes.tsv.gz
## features.tsv.gz
## matrix.mtx.gz
ls ept/
## barcodes.tsv.gz
## features.tsv.gz
## matrix.mtx.gz

2. Cell clustering

The following codes run in R.

require(Seurat)
require(Yano)
require(dplyr) ## for processing table
require(DT) ## view table

count <- ReadPISA("./exp/")

# We use 1000 genes as a cutoff to filter valid cells or droplets for convenience.
# Some packages like DropletUtils and Cellbender may perform a more reliable selection.
obj <- CreateSeuratObject(count, min.features = 1000, min.cells = 10)

# We get 5164 raw cells
obj
## An object of class Seurat 
## 24730 features across 5164 samples within 1 assay 
## Active assay: RNA (24730 features, 0 variable features)
# Check the meta data of cells
head(obj@meta.data)
##                       orig.ident nCount_RNA nFeature_RNA
## TTCCACGGTTCGGCTG-1 SeuratProject      30943         6812
## GTGCTTCGTAATGTGA-1 SeuratProject      33492         7048
## CACAGGCAGCTGGTGA-1 SeuratProject      40113         7787
## TGGCGTGGTGCAACGA-1 SeuratProject      30056         6849
## ATCTCTAGTCCATAGT-1 SeuratProject      31092         6879
## AGCGCCACAAGCGATG-1 SeuratProject      12433         4132
# Now check the gene from mitochondria and red blood cells
obj[["percent.mt"]] <- PercentageFeatureSet(obj, pattern = "^MT-")
obj[["percent.hg"]] <- PercentageFeatureSet(obj, pattern = "^HB[ABDEGQZ12]+$")

# Now new meta info added
head(obj@meta.data)
##                       orig.ident nCount_RNA nFeature_RNA percent.mt percent.hg
## TTCCACGGTTCGGCTG-1 SeuratProject      30943         6812   6.980577          0
## GTGCTTCGTAATGTGA-1 SeuratProject      33492         7048   5.380389          0
## CACAGGCAGCTGGTGA-1 SeuratProject      40113         7787   6.160098          0
## TGGCGTGGTGCAACGA-1 SeuratProject      30056         6849   8.933324          0
## ATCTCTAGTCCATAGT-1 SeuratProject      31092         6879   5.940435          0
## AGCGCCACAAGCGATG-1 SeuratProject      12433         4132   2.718572          0
# You can visulise any meta information with VlnPlot
VlnPlot(obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt", "percent.hg"), ncol = 4)

# Now filter the cell with `abnormal` high gene expression and mito gene expression
obj <- subset(obj, nFeature_RNA < 9000 & percent.mt < 20)

obj
## An object of class Seurat 
## 24730 features across 4973 samples within 1 assay 
## Active assay: RNA (24730 features, 0 variable features)
# We only have the count matrix now, but to perform further analysis we need to normalize
# the data and to remove the bias introduced by sequence depth
obj[['RNA']]@counts[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                 TTCCACGGTTCGGCTG-1 GTGCTTCGTAATGTGA-1 CACAGGCAGCTGGTGA-1
## ENSG00000238009                  .                  .                  1
## ENSG00000268903                  .                  .                  .
## ENSG00000241860                  .                  .                  .
## MTND1P23                         .                  .                  .
## MTND2P28                         .                  .                  .
##                 TGGCGTGGTGCAACGA-1 ATCTCTAGTCCATAGT-1
## ENSG00000238009                  1                  1
## ENSG00000268903                  .                  .
## ENSG00000241860                  .                  .
## MTND1P23                         .                  .
## MTND2P28                         1                  .
# the `data` matrix used to store the normalized expression value, but now it's the same with raw counts
obj[['RNA']]@data[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                 TTCCACGGTTCGGCTG-1 GTGCTTCGTAATGTGA-1 CACAGGCAGCTGGTGA-1
## ENSG00000238009                  .                  .                  1
## ENSG00000268903                  .                  .                  .
## ENSG00000241860                  .                  .                  .
## MTND1P23                         .                  .                  .
## MTND2P28                         .                  .                  .
##                 TGGCGTGGTGCAACGA-1 ATCTCTAGTCCATAGT-1
## ENSG00000238009                  1                  1
## ENSG00000268903                  .                  .
## ENSG00000241860                  .                  .
## MTND1P23                         .                  .
## MTND2P28                         1                  .
# Log scale the gene counts, actually it's log10(counts/cell.size*scale.factor+1).
# Note: some other package like scran may use log10(counts+1)
obj <- NormalizeData(obj)
# The data matrix now updated
obj[['RNA']]@data[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                 TTCCACGGTTCGGCTG-1 GTGCTTCGTAATGTGA-1 CACAGGCAGCTGGTGA-1
## ENSG00000238009                  .                  .            0.22258
## ENSG00000268903                  .                  .            .      
## ENSG00000241860                  .                  .            .      
## MTND1P23                         .                  .            .      
## MTND2P28                         .                  .            .      
##                 TGGCGTGGTGCAACGA-1 ATCTCTAGTCCATAGT-1
## ENSG00000238009          0.2872162          0.2788629
## ENSG00000268903          .                  .        
## ENSG00000241860          .                  .        
## MTND1P23                 .                  .        
## MTND2P28                 0.2872162          .
# Find the highly variable genes for downstream analysis
# We cannot use all genes for dimension reduction and clustering because two points:
# 1) it will be extremely slow 
# 2) some genes may not so informative ..
# ref: 
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000)

# The scale step will scale gene expression with a mean value of 0. 
# Usually only scale the HVGs.
# Some people like to scale all genes, but the scaled matrix is a dense matrix. 
# So it will use a lot of memory and perhaps not necessary
obj <- ScaleData(obj, features = VariableFeatures(obj))
obj[['RNA']]@scale.data[1:5,1:5]
##         TTCCACGGTTCGGCTG-1 GTGCTTCGTAATGTGA-1 CACAGGCAGCTGGTGA-1
## HES4            -0.8057120         0.09033286       0.4755769926
## ISG15           -0.7324449        -0.73244491       0.0005621167
## TNFRSF4         -0.1925974        -0.19259736      -0.1925973591
## SAMD11          -0.1686980         4.90684781      -0.1686980112
## MXRA8           -0.4777915         0.20511186       1.9127692159
##         TGGCGTGGTGCAACGA-1 ATCTCTAGTCCATAGT-1
## HES4           -0.09112569         -0.4150447
## ISG15          -0.35542388          0.5259282
## TNFRSF4        -0.19259736         -0.1925974
## SAMD11         -0.16869801         -0.1686980
## MXRA8           0.27292836         -0.4777915
obj <- RunPCA(obj, features =  VariableFeatures(obj))

# Cluster cell groups
obj <- FindNeighbors(obj, dims = 1:10)
obj <- FindClusters(obj, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4973
## Number of edges: 156733
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9032
## Number of communities: 13
## Elapsed time: 0 seconds
DimPlot(obj, reduction = "pca")

# Reduce feature space to 2D space
obj <- RunUMAP(obj, dims = 1:10)
DimPlot(obj, reduction = "umap")

# label the clusters
DimPlot(obj, label=TRUE, label.size = 10, label.box = TRUE)

markers <- FindAllMarkers(obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
markers %>% group_by(cluster) %>% slice_max(n = 5, order_by = avg_log2FC) %>% datatable(options = list(scrollX = T))
markers %>% group_by(cluster) %>% slice_max(n = 1, order_by = avg_log2FC) %>% pull(gene) %>% unique() -> sel
FeaturePlot(obj, features = sel, ncol = 5)

# GSVA
DefaultAssay(obj) <- "RNA"
mat <- AggregateExpression(obj, assays = "RNA", slot="counts")
markers %>% group_by(cluster) %>% slice_max(n = 100, order_by = avg_log2FC) %>% pull(gene) %>% unique()->sel
mat <- mat$RNA[sel,]
dim(mat)
## [1] 948  13
require(msigdbr) # to load database
require(GSVA)
require(ComplexHeatmap)

kegg.dat <-  msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:KEGG") 
kegg.genes <- split(kegg.dat$gene_symbol, kegg.dat$gs_name)

go.dat <- msigdbr(species = "Homo sapiens", category = "C5") %>% filter(gs_subcat != "HPO")
go.genes <- split(go.dat$gene_symbol, go.dat$gs_name)

gsva.go.result <- gsva(expr=mat, gset.idx.list=go.genes, kcdf="Poisson", verbose=FALSE, parallel.sz = 16, mx.diff=1)
Heatmap(gsva.go.result)

gsva.kegg.result <- gsva(expr=mat, gset.idx.list=kegg.genes, kcdf="Poisson", verbose=FALSE, parallel.sz = 16, mx.diff=1)
Heatmap(gsva.kegg.result)

Homework

  • Annotate cell groups.
  • Change resolution to 0.1 and 1 in FindClusters(), and compare the cell groups.
  • Change the dims to 1:5 and 1:20 in RunUMAP(), and compare the umaps.

3. Alternative splicing analysis (exon mode)

The following codes run in R.

exon <- ReadPISA("./exon/")

obj[['EXON']] <- CreateAssayObject(exon[, colnames(obj)], min.cells=30)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
# A new EPT assay now created
names(obj)
## [1] "RNA"     "EXON"    "RNA_nn"  "RNA_snn" "pca"     "umap"
DefaultAssay(obj) <- "EXON"

# Check the meta data for features. Try to understand between meta.features and obj@meta.data
head(obj[['RNA']]@meta.features)
##                    vst.mean vst.variance vst.variance.expected
## ENSG00000238009 0.002614116  0.002607807           0.002740098
## ENSG00000268903 0.012467324  0.013118871           0.014557967
## ENSG00000241860 0.005228232  0.005201944           0.005743268
## MTND1P23        0.007038005  0.006989877           0.007863548
## MTND2P28        0.125477579  0.143544246           0.171295349
## MTCO1P12        0.030363965  0.030252422           0.037907084
##                 vst.variance.standardized vst.variable
## ENSG00000238009                 0.9517202        FALSE
## ENSG00000268903                 0.9011471        FALSE
## ENSG00000241860                 0.9057463        FALSE
## MTND1P23                        0.8888961        FALSE
## MTND2P28                        0.8379927        FALSE
## MTCO1P12                        0.7980678        FALSE
head(obj[['EXON']]@meta.features)
## data frame with 0 columns and 6 rows
obj <- ParseExonName(obj)
## Working on assay EXON
# update exon annotation
head(obj[['EXON']]@meta.features)
##                                       chr  start    end strand       gene_name
## chr1:135141-135895/-/ENSG00000268903 chr1 135141 135895      - ENSG00000268903
## chr1:629062-629433/+/MTND1P23        chr1 629062 629433      +        MTND1P23
## chr1:629640-630683/+/MTND2P28        chr1 629640 630683      +        MTND2P28
## chr1:631074-632616/+/MTCO1P12        chr1 631074 632616      +        MTCO1P12
## chr1:632757-633438/+/MTCO2P12        chr1 632757 633438      +        MTCO2P12
## chr1:633696-634376/+/MTATP6P1        chr1 633696 634376      +        MTATP6P1
obj <- RunAutoCorr(obj, threads = 8)
## Working on assay : EXON
## Build weights on pca
## Run autocorrelation test for 142112 features.
obj <- SetAutoCorrFeatures(obj, moransi.min = 0.1)
## 37627 autocorrelated features.
# This step may take a while; depending on the feature number and cell number, the runtime may range from seconds to hours. 
# The default permutation step is 1000. It's probably too overwhelming. Here, we change perm to 100 to save time.
obj <- RunBlockCorr(obj, block.name = "gene_name", block.assay = "RNA", threads=8, perm=100)
## Working on assay EXON
## Working on 37627 features.
## Build weights on pca
## Processing 8649 blocks..
## Trying to retrieve data from assay RNA..
## Test dissimlarity of two processes ..
## Runtime : 3.734613 mins
# Gene name and type now added
head(obj[['EXON']]@meta.features)
##                                       chr  start    end strand       gene_name
## chr1:135141-135895/-/ENSG00000268903 chr1 135141 135895      - ENSG00000268903
## chr1:629062-629433/+/MTND1P23        chr1 629062 629433      +        MTND1P23
## chr1:629640-630683/+/MTND2P28        chr1 629640 630683      +        MTND2P28
## chr1:631074-632616/+/MTCO1P12        chr1 631074 632616      +        MTCO1P12
## chr1:632757-633438/+/MTCO2P12        chr1 632757 633438      +        MTCO2P12
## chr1:633696-634376/+/MTATP6P1        chr1 633696 634376      +        MTATP6P1
##                                           MoransI MoransI.pval AutoCorrFeature
## chr1:135141-135895/-/ENSG00000268903  0.048799765 4.686871e-16           FALSE
## chr1:629062-629433/+/MTND1P23        -0.003440520 7.030929e-01           FALSE
## chr1:629640-630683/+/MTND2P28         0.055577603 6.064212e-20           FALSE
## chr1:631074-632616/+/MTCO1P12         0.016632877 3.054542e-03           FALSE
## chr1:632757-633438/+/MTCO2P12         0.008546932 7.652688e-02           FALSE
## chr1:633696-634376/+/MTATP6P1         0.527425008 0.000000e+00            TRUE
##                                      gene_name.D gene_name.r gene_name.pval
## chr1:135141-135895/-/ENSG00000268903          NA          NA             NA
## chr1:629062-629433/+/MTND1P23                 NA          NA             NA
## chr1:629640-630683/+/MTND2P28                 NA          NA             NA
## chr1:631074-632616/+/MTCO1P12                 NA          NA             NA
## chr1:632757-633438/+/MTCO2P12                 NA          NA             NA
## chr1:633696-634376/+/MTATP6P1                 NA          NA             NA
# Plot Genome wide feature binding test plot
FbtPlot(obj, val = "gene_name.pval")

obj[['EXON']]@meta.features %>% filter(gene_name.pval < 0.001) %>% datatable(options = list(scrollX = T))
# Normalize Exon expression for featureplot
obj <- NormalizeData(obj)

FeaturePlot(obj, features = c("chr11:123061588-123061833/-/HSPA8", "HSPA8"), order = TRUE, pt.size=1)
## Warning: Could not find HSPA8 in the default search locations, found in RNA
## assay instead

# Prepare GTF database for track plots 
db <- gtf2db("./gencode.v44.annotation.gtf.gz")
## [2023-10-30 00:32:32] GTF loading..
## [2023-10-30 00:33:53] Load 62700 genes.
## [2023-10-30 00:33:53] Load time : 80.858 sec
## Done!
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="HSPA8",cell.group =  Idents(obj),  max.depth=500, highlights = c(123061588,123061833))
## Chromosome chr11, start 123056489, end 123064230
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

4. EPT analysis

The following codes run in R.

names(obj)
## [1] "RNA"     "EXON"    "RNA_nn"  "RNA_snn" "pca"     "umap"
ept <- ReadPISA("./ept")
# Now we create a new ASSAY for the Seurat Object, look this step very carefully, 
# and understand the different between CreateSeuratObject and CreateAssayObject
obj[['EPT']] <- CreateAssayObject(ept[,colnames(obj)], min.cells = 30)

# A new EPT assay now created
names(obj)
## [1] "RNA"     "EXON"    "EPT"     "RNA_nn"  "RNA_snn" "pca"     "umap"
# Check the working assay now
DefaultAssay(obj) 
## [1] "EXON"
# Let's change the assay to EPT
DefaultAssay(obj) <- "EPT"

head(obj[['EPT']]@meta.features)
## data frame with 0 columns and 6 rows
# Annotate the EPTs with predefined annotation file
obj <- LoadEPTanno(file="epts_anno.bed", object=obj)
## Working on assay EPT
## Intersect 245521 features.
# Gene name and type now added
head(obj[['EPT']]@meta.features)
##                       chr  start    end                 name strand n_gene
## chr1:14665-14787/-   chr1  14665  14787   chr1:14665-14787/-      -      1
## chr1:134935-135281/- chr1 134935 135281 chr1:134935-135281/-      -      1
## chr1:629103-629323/+ chr1 629103 629323 chr1:629103-629323/+      +      1
## chr1:629906-630021/- chr1 629906 630021 chr1:629906-630021/-      -      1
## chr1:629906-630025/+ chr1 629906 630025 chr1:629906-630025/+      +      1
## chr1:632266-632703/+ chr1 632266 632703 chr1:632266-632703/+      +      1
##                            gene_name       type
## chr1:14665-14787/-            WASH7P     intron
## chr1:134935-135281/- ENSG00000268903 exonintron
## chr1:629103-629323/+        MTND1P23       exon
## chr1:629906-630021/- ENSG00000230021     intron
## chr1:629906-630025/+        MTND2P28       exon
## chr1:632266-632703/+        MTCO1P12 exonintron
obj <- RunAutoCorr(obj)
## Working on assay : EPT
## Build weights on pca
## Run autocorrelation test for 245677 features.
obj <- SetAutoCorrFeatures(obj, moransi.min = 0.1)
## 17316 autocorrelated features.
obj <- RunBlockCorr(obj, block.name = "gene_name", block.assay = "RNA", threads=8, perm=100)
## Working on assay EPT
## Working on 17316 features.
## Build weights on pca
## Processing 7796 blocks..
## Trying to retrieve data from assay RNA..
## Test dissimlarity of two processes ..
## Runtime : 1.711828 mins
# Gene name and type now added
head(obj[['EPT']]@meta.features)
##                       chr  start    end                 name strand n_gene
## chr1:14665-14787/-   chr1  14665  14787   chr1:14665-14787/-      -      1
## chr1:134935-135281/- chr1 134935 135281 chr1:134935-135281/-      -      1
## chr1:629103-629323/+ chr1 629103 629323 chr1:629103-629323/+      +      1
## chr1:629906-630021/- chr1 629906 630021 chr1:629906-630021/-      -      1
## chr1:629906-630025/+ chr1 629906 630025 chr1:629906-630025/+      +      1
## chr1:632266-632703/+ chr1 632266 632703 chr1:632266-632703/+      +      1
##                            gene_name       type     MoransI MoransI.pval
## chr1:14665-14787/-            WASH7P     intron 0.003183699 2.906980e-01
## chr1:134935-135281/- ENSG00000268903 exonintron 0.112081399 3.187695e-75
## chr1:629103-629323/+        MTND1P23       exon 0.009076548 6.354737e-02
## chr1:629906-630021/- ENSG00000230021     intron 0.007448170 1.061480e-01
## chr1:629906-630025/+        MTND2P28       exon 0.054112384 5.234265e-19
## chr1:632266-632703/+        MTCO1P12 exonintron 0.035879924 2.177869e-09
##                      AutoCorrFeature gene_name.D gene_name.r gene_name.pval
## chr1:14665-14787/-             FALSE          NA          NA             NA
## chr1:134935-135281/-            TRUE          NA          NA             NA
## chr1:629103-629323/+           FALSE          NA          NA             NA
## chr1:629906-630021/-           FALSE          NA          NA             NA
## chr1:629906-630025/+           FALSE          NA          NA             NA
## chr1:632266-632703/+           FALSE          NA          NA             NA
# Plot Genome wide feature binding test plot
FbtPlot(obj, val = "gene_name.pval")

# Color by EPT types
type.cols <- c("#131313", "blue", "#A6CEE3", "#1F78B4", "#B2DF8A", "#33A02C", "#FB9A99", "#E31A1C", "#FDBF6F", "#FF7F00", "#CAB2D6", "#6A3D9A", "#FFFF99", "#B15928")
names(type.cols) <- c("antisense_complex", "antisense_exon", "antisense_intron", "antisense_utr3", "antisense_utr5", "exon","exonintron", "intergenic", "intron", "multiexons", "multigenes","utr3", "utr5", "whole_gene")
FbtPlot(obj, val = "gene_name.pval", col.by = "type", cols = type.cols, size=2)

df <- obj[['EPT']]@meta.features
# use %>% from dplyr 
df %>% filter(gene_name.pval < 1e-10) %>% datatable(options = list(scrollX = T))
# Normalize EPT counts for visulization
obj <- NormalizeData(obj)

FeaturePlot(obj, features = c("chr19:16095200-16095559/+", "TPM4"), order=TRUE, pt.size = 2)
## Warning: Could not find TPM4 in the default search locations, found in RNA
## assay instead

plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="TPM4",cell.group =  Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(16095200,16095559), type.col = type.cols)
## Chromosome chr19, start 16066021, end 16104002
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

# Capped max normalized depth to 1000
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="TPM4",cell.group =  Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(16095200,16095559), type.col = type.cols, max.depth = 1000)
## Chromosome chr19, start 16066021, end 16104002
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

FeaturePlot(obj, features = c("chr2:27889096-27890852/-", "RBKS"), order = TRUE, pt.size = 2)
## Warning: Could not find RBKS in the default search locations, found in RNA
## assay instead

plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="RBKS",cell.group =  Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(27889096,27890852), type.col = type.cols)
## Chromosome chr2, start 27780379, end 27891681
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

# Log-scale normalized UMI depth
plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="RBKS",cell.group =  Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(27889096,27890852), type.col = type.cols, log.scaled = TRUE)
## Chromosome chr2, start 27780379, end 27891681
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

5. Comparsion of Exon and EPT analysis

exon.meta <-obj[['EXON']]@meta.features
ept.meta <- obj[['EPT']]@meta.features
alt.exon <- exon.meta %>% filter(gene_name.pval < 0.01) %>% pull(gene_name) %>% unique()
alt.ept <- ept.meta %>% filter(gene_name.pval < 0.01 & type %in% c("exon", "utr3", "utr5", "exonintron", "multiexons")) %>% pull(gene_name) %>% unique()
require(ggvenn)
ggvenn(list("EXON"=alt.exon, "EPT"=alt.ept),text_size = 8)

setdiff(alt.exon, alt.ept)
## [1] "ENSG00000288663" "MYL6"            "SERF2"           "RPL17-C18orf32" 
## [5] "ENSG00000268173"
setdiff(alt.ept, alt.exon)
##  [1] "PDE4B"           "NME7"            "BAG3"            "CD163L1"        
##  [5] "SIPA1L1"         "SNAP23"          "ANXA2"           "ENSG00000285505"
##  [9] "C5AR1"           "RBKS"            "MXD1"            "NFE2L2"         
## [13] "CREB1"           "NFATC2"          "RPL15"           "MAPK10"         
## [17] "PHACTR1"         "DDAH2"           "SOD2"            "PILRA"          
## [21] "HAS2-AS1"
exon.meta %>% filter(gene_name %in% setdiff(alt.exon, alt.ept) & gene_name.pval < 0.01)  %>% datatable(options = list(scrollX = T))
DefaultAssay(obj) <- "EXON"
FeaturePlot(obj, features = c("chr12:53303034-53306861/+/ENSG00000288663","chr12:53306680-53306861/+/ENSG00000288663","ENSG00000288663"), order = TRUE, pt.size = 2, ncol=3)
## Warning: Could not find ENSG00000288663 in the default search locations, found
## in RNA assay instead

FeaturePlot(obj, features = c("chr12:53303034-53306861/+/ENSG00000288663","chr12:53306680-53306861/+/ENSG00000288663","ENSG00000288663"), order = TRUE, pt.size = 2, ncol=3, cols=c("blue","white","red"))
## Warning: Could not find ENSG00000288663 in the default search locations, found
## in RNA assay instead

plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="ENSG00000288663",cell.group =  Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(53303034,53306861), type.col = type.cols, max.depth = 1000)
## Chromosome chr12, start 53294492, end 53308174
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

FeaturePlot(obj, features = c("chr12:56161387-56161465/+/MYL6", "MYL6"), order=TRUE, pt.size=1)
## Warning: Could not find MYL6 in the default search locations, found in RNA
## assay instead

plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="MYL6",cell.group =  Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(56161387,56161465), type.col = type.cols, max.depth = 8000)
## Chromosome chr12, start 56157346, end 56164496
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

ept.meta %>% filter(gene_name %in% setdiff(alt.ept,alt.exon) & gene_name.pval < 0.01 & type %in% c("exon","utr3","utr5","exonintron","multiexons")) %>% datatable(options = list(scrollX = T))
DefaultAssay(obj) <- "EPT"
FeaturePlot(obj, features = c("chr1:65992643-65994170/+", "PDE4B"), pt.size = 2, order = TRUE)
## Warning: Could not find PDE4B in the default search locations, found in RNA
## assay instead

plotTracks(bamfile="./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam", db=db, gene="PDE4B",cell.group =  Idents(obj), meta.features = obj[['EPT']]@meta.features, highlights = c(65992643,65994170), type.col = type.cols, max.depth = 100)
## Chromosome chr1, start 65791514, end 66375579
## Process ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

6. EAT analysis

The following EAT analysis is based on the genetic variants. Therefore, we need to call variants from RNA reads first. The genetic variants in RNA sequence may come from three inputs:

[1] Germline variants; [2] Somatic mutations; [3] RNA editings.

Tips: for inbreeding mouse species, genetic variants are likely come from somatic mutation and RNA editings, considering very few genetic heterogeneity variants.

The following codes run in Terminal.

# download genome reference
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.p14.genome.fa.gz 
gzip -d /GRCh38.p14.genome.fa.gz 
# Variant calling. Here I use bcftools to call variants, other tools like GATK also appliable. This step usually be very slow.
# Donot filter variants, use raw VCF for downstream analysis. Because we are trying to label reads and generate EAT counts,
# low quality variants will be filtered during downstream QC.
bcftools mpileup -Ou -f ./GRCh38.p14.genome.fa  ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam -d 10000 | bcftools call -vmO z -o var.vcf.gz

PISA anno -vcf var.vcf.gz -vcf-ss -ref-alt -vtag VF -o anno.bam ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

mkdir var
PISA count -tag CB -umi UB -anno-tag VF -outdir var anno.bam
ls var/
## barcodes.tsv.gz
## features.tsv.gz
## matrix.mtx.gz

The following codes run in R.

var <- ReadPISA("./var")
obj[['VAR']] <- CreateAssayObject(var[,colnames(obj)], min.cells = 10)
DefaultAssay(obj) <- "VAR"

obj <- NormalizeData(obj)
obj <- LoadVARanno(object = obj, file = "epts_anno.bed")
## Working on assay VAR
head(obj[['VAR']]@meta.features)
##                   chr  start strand         locus                  ept
## chr1:631862G>A/+ chr1 631862      + chr1:631862/+ chr1:631822-631943/+
## chr1:633192G>A/+ chr1 633192      + chr1:633192/+ chr1:633103-633310/+
## chr1:633192G=/+  chr1 633192      + chr1:633192/+ chr1:633103-633310/+
## chr1:633236A>G/+ chr1 633236      + chr1:633236/+ chr1:633103-633310/+
## chr1:633300A=/+  chr1 633300      + chr1:633300/+ chr1:633103-633310/+
## chr1:633372T>C/+ chr1 633372      + chr1:633372/+ chr1:633365-633463/+
##                  gene_name   ept_type type
## chr1:631862G>A/+  MTCO1P12       exon  alt
## chr1:633192G>A/+  MTCO2P12       exon  alt
## chr1:633192G=/+   MTCO2P12       exon  ref
## chr1:633236A>G/+  MTCO2P12       exon  alt
## chr1:633300A=/+   MTCO2P12       exon  ref
## chr1:633372T>C/+  MTCO2P12 exonintron  alt
obj <- RunAutoCorr(obj, threads = 8)
## Working on assay : VAR
## Build weights on pca
## Run autocorrelation test for 112197 features.
obj <- SetAutoCorrFeatures(obj, p.cutoff = 0.001)
## 14881 autocorrelated features.
obj <- RunBlockCorr(object = obj, block.name = "locus", perm =100, threads = 8)
## Working on assay VAR
## Working on 14881 features.
## Build weights on pca
## Processing 5500 blocks..
## Aggregate counts..
## Test dissimlarity of two processes ..
## Runtime : 50.46148 secs
head(obj[['VAR']]@meta.features)
##                   chr  start strand         locus                  ept
## chr1:631862G>A/+ chr1 631862      + chr1:631862/+ chr1:631822-631943/+
## chr1:633192G>A/+ chr1 633192      + chr1:633192/+ chr1:633103-633310/+
## chr1:633192G=/+  chr1 633192      + chr1:633192/+ chr1:633103-633310/+
## chr1:633236A>G/+ chr1 633236      + chr1:633236/+ chr1:633103-633310/+
## chr1:633300A=/+  chr1 633300      + chr1:633300/+ chr1:633103-633310/+
## chr1:633372T>C/+ chr1 633372      + chr1:633372/+ chr1:633365-633463/+
##                  gene_name   ept_type type      MoransI MoransI.pval
## chr1:631862G>A/+  MTCO1P12       exon  alt -0.003823755  0.729139599
## chr1:633192G>A/+  MTCO2P12       exon  alt -0.002721279  0.666664863
## chr1:633192G=/+   MTCO2P12       exon  ref  0.004190248  0.232391666
## chr1:633236A>G/+  MTCO2P12       exon  alt  0.014472296  0.006542342
## chr1:633300A=/+   MTCO2P12       exon  ref  0.003431257  0.271878196
## chr1:633372T>C/+  MTCO2P12 exonintron  alt -0.003214307  0.696233522
##                  AutoCorrFeature locus.D locus.r locus.pval
## chr1:631862G>A/+           FALSE      NA      NA         NA
## chr1:633192G>A/+           FALSE      NA      NA         NA
## chr1:633192G=/+            FALSE      NA      NA         NA
## chr1:633236A>G/+           FALSE      NA      NA         NA
## chr1:633300A=/+            FALSE      NA      NA         NA
## chr1:633372T>C/+           FALSE      NA      NA         NA
FbtPlot(obj, val = "locus.pval")

obj[['VAR']]@meta.features %>% filter(locus.pval < 0.001)
##                      chr    start strand            locus
## chr10:17237326C=/+ chr10 17237326      + chr10:17237326/+
##                                          ept gene_name ept_type type   MoransI
## chr10:17237326C=/+ chr10:17236750-17238079/+       VIM     utr3  ref 0.6508988
##                    MoransI.pval AutoCorrFeature  locus.D   locus.r   locus.pval
## chr10:17237326C=/+            0            TRUE 0.677191 0.1854308 3.551327e-38
obj[['VAR']]@meta.features %>% filter(locus %in% c("chr10:17237326/+")) %>% rownames()
## [1] "chr10:17237326C>T/+" "chr10:17237326C=/+"
# Question:
# How to distinguish chr10:17237326C>T/+ is a somatic mutation or germline mutation or RNA editing?
FeaturePlot(obj, features = c("chr10:17237326C>T/+", "chr10:17237326C=/+" ), order=TRUE, pt.size = 2)

saveRDS(obj, file="glioblastoma5k.rds")

7. FAQ

8. Reference & Further reading

9. SessionInfo

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggvenn_0.1.10         ggplot2_3.4.4         ComplexHeatmap_2.15.3
##  [4] GSVA_1.40.1           msigdbr_7.5.1         DT_0.30              
##  [7] dplyr_1.1.3           Yano_0.0.0.9999       Matrix_1.6-1.1       
## [10] SeuratObject_4.1.4    Seurat_4.4.0         
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.4                  spatstat.explore_3.2-5     
##   [3] reticulate_1.34.0           R.utils_2.12.2             
##   [5] tidyselect_1.2.0            RSQLite_2.3.1              
##   [7] AnnotationDbi_1.56.2        htmlwidgets_1.6.2          
##   [9] BiocParallel_1.28.3         Rtsne_0.16                 
##  [11] ScaledMatrix_1.2.0          munsell_0.5.0              
##  [13] codetools_0.2-18            ica_1.0-3                  
##  [15] future_1.33.0               miniUI_0.1.1.1             
##  [17] withr_2.5.1                 spatstat.random_3.2-1      
##  [19] colorspace_2.1-0            progressr_0.14.0           
##  [21] Biobase_2.54.0              knitr_1.44                 
##  [23] rstudioapi_0.15.0           stats4_4.1.2               
##  [25] SingleCellExperiment_1.16.0 ROCR_1.0-11                
##  [27] tensor_1.5                  listenv_0.9.0              
##  [29] MatrixGenerics_1.6.0        labeling_0.4.3             
##  [31] GenomeInfoDbData_1.2.7      polyclip_1.10-6            
##  [33] bit64_4.0.5                 farver_2.1.1               
##  [35] pheatmap_1.0.12             rhdf5_2.38.1               
##  [37] parallelly_1.36.0           vctrs_0.6.4                
##  [39] generics_0.1.3              xfun_0.40                  
##  [41] doParallel_1.0.17           R6_2.5.1                   
##  [43] GenomeInfoDb_1.30.1         clue_0.3-64                
##  [45] ggbeeswarm_0.7.2            rsvd_1.0.5                 
##  [47] bitops_1.0-7                rhdf5filters_1.6.0         
##  [49] spatstat.utils_3.0-3        cachem_1.0.8               
##  [51] DelayedArray_0.20.0         promises_1.2.1             
##  [53] scales_1.2.1                beeswarm_0.4.0             
##  [55] gtable_0.3.4                beachmat_2.10.0            
##  [57] Cairo_1.6-1                 globals_0.16.2             
##  [59] goftest_1.2-3               rlang_1.1.1                
##  [61] GlobalOptions_0.1.2         splines_4.1.2              
##  [63] lazyeval_0.2.2              spatstat.geom_3.2-7        
##  [65] yaml_2.3.7                  reshape2_1.4.4             
##  [67] abind_1.4-5                 crosstalk_1.2.0            
##  [69] httpuv_1.6.12               tools_4.1.2                
##  [71] nabor_0.5.0                 ellipsis_0.3.2             
##  [73] jquerylib_0.1.4             RColorBrewer_1.1-3         
##  [75] BiocGenerics_0.40.0         ggridges_0.5.4             
##  [77] Rcpp_1.0.11                 plyr_1.8.9                 
##  [79] sparseMatrixStats_1.6.0     zlibbioc_1.40.0            
##  [81] purrr_1.0.2                 RCurl_1.98-1.12            
##  [83] deldir_1.0-9                GetoptLong_1.0.5           
##  [85] pbapply_1.7-2               cowplot_1.1.1              
##  [87] S4Vectors_0.32.4            zoo_1.8-12                 
##  [89] SummarizedExperiment_1.24.0 ggrepel_0.9.4              
##  [91] cluster_2.1.2               magrittr_2.0.3             
##  [93] magick_2.8.1                data.table_1.14.8          
##  [95] scattermore_1.2             circlize_0.4.15            
##  [97] lmtest_0.9-40               RANN_2.6.1                 
##  [99] fitdistrplus_1.1-11         matrixStats_1.0.0          
## [101] patchwork_1.1.3             mime_0.12                  
## [103] evaluate_0.22               xtable_1.8-4               
## [105] XML_3.99-0.14               shape_1.4.6                
## [107] IRanges_2.28.0              gridExtra_2.3              
## [109] compiler_4.1.2              tibble_3.2.1               
## [111] KernSmooth_2.23-20          crayon_1.5.2               
## [113] R.oo_1.25.0                 htmltools_0.5.6.1          
## [115] later_1.3.1                 tidyr_1.3.0                
## [117] DBI_1.1.3                   MASS_7.3-55                
## [119] babelgene_22.9              cli_3.6.1                  
## [121] R.methodsS3_1.8.2           parallel_4.1.2             
## [123] igraph_1.5.1                GenomicRanges_1.46.1       
## [125] pkgconfig_2.0.3             sp_2.1-1                   
## [127] plotly_4.10.3               spatstat.sparse_3.0-2      
## [129] foreach_1.5.2               annotate_1.70.0            
## [131] vipor_0.4.5                 bslib_0.5.1                
## [133] XVector_0.34.0              stringr_1.5.0              
## [135] digest_0.6.33               sctransform_0.4.1          
## [137] RcppAnnoy_0.0.21            graph_1.70.0               
## [139] spatstat.data_3.0-1         Biostrings_2.60.2          
## [141] rmarkdown_2.25              leiden_0.4.3               
## [143] uwot_0.1.16                 DelayedMatrixStats_1.16.0  
## [145] GSEABase_1.54.0             shiny_1.7.5.1              
## [147] gtools_3.9.4                rjson_0.2.21               
## [149] lifecycle_1.0.3             nlme_3.1-155               
## [151] jsonlite_1.8.7              Rhdf5lib_1.16.0            
## [153] viridisLite_0.4.2           limma_3.50.3               
## [155] fansi_1.0.5                 pillar_1.9.0               
## [157] lattice_0.20-45             ggrastr_1.0.2              
## [159] KEGGREST_1.34.0             fastmap_1.1.1              
## [161] httr_1.4.7                  survival_3.5-5             
## [163] glue_1.6.2                  iterators_1.0.14           
## [165] png_0.1-8                   bit_4.0.5                  
## [167] stringi_1.7.12              sass_0.4.7                 
## [169] HDF5Array_1.22.1            blob_1.2.4                 
## [171] BiocSingular_1.10.0         memoise_2.0.1              
## [173] irlba_2.3.5.1               future.apply_1.11.0